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Abstract 

The cultural diversity of culinary practice, as illustrated by the variety of regional 
cuisines, raises the question of whether there are any general patterns that determine the 
ingredient combinations used in food today or principles that transcend individual tastes 
and recipes. We introduce a flavor network that captures the flavor compounds shared by 
culinary ingredients. Western cuisines show a tendency to use ingredient pairs that share 
many flavor compounds, supporting the so-called food pairing hypothesis. By contrast, East 
Asian cuisines tend to avoid compound sharing ingredients. Given the increasing availabil- 
ity of information on food preparation, our data-driven investigation opens new avenues 
towards a systematic understanding of culinary practice. 

As omnivores, humans have historically faced the difficult task of identifying and gather- 
ing food that satisfies nutritional needs while avoiding foodborne illnesses [1J. This process 
has contributed to the current diet of humans, which is influenced by factors ranging from an 
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evolved preference for sugar and fat to palatability, nutritional value, culture, ease of produc- 
tion, and climate [Q3l2l[3l|4l|5l[6l|7l[8l|9l. The relatively small number of recipes in use (~ 10 6 , 
e.g. http://cookpad.com) compared to the enormous number of potential recipes (> 10 15 , see 
Supplementary Information Sec SI. 2), together with the frequent recurrence of particular com- 
binations in various regional cuisines, indicates that we are exploiting but a tiny fraction of the 
potential combinations. Although this pattern itself can be explained by a simple evolutionary 
model ifTOl or data-driven approaches 0TL a fundamental question still remains: are there any 
quantifiable and reproducible principles behind our choice of certain ingredient combinations 
and avoidance of others? 

Although many factors such as colors, texture, temperature, and sound play an important 
role in food sensation [[T2l[T3l[lH[T5]|, palatability is largely determined by flavor, representing 
a group of sensations including odors (due to molecules that can bind olfactory receptors), tastes 
(due to molecules that stimulate taste buds), and freshness or pungency (trigeminal senses) [16]. 
Therefore, the flavor compound (chemical) profile of the culinary ingredients is a natural start- 
ing point for a systematic search for principles that might underlie our choice of acceptable 
ingredient combinations. 

A hypothesis, which over the past decade has received attention among some chefs and food 
scientists, states that ingredients sharing flavor compounds are more likely to taste well together 
than ingredients that do not [TTvTl . This food pairing hypothesis has been used to search for novel 
ingredient combinations and has prompted, for example, some contemporary restaurants to 
combine white chocolate and caviar, as they share trimethylamine and other flavor compounds, 
or chocolate and blue cheese that share at least 73 flavor compounds. As we search for evidence 
supporting (or refuting) any 'rules' that may underlie our recipes, we must bear in mind that the 
scientific analysis of any art, including the art of cooking, is unlikely to be capable of explaining 
every aspect of the artistic creativity involved. Furthermore, there are many ingredients whose 
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main role in a recipe may not be only flavoring but something else as well (e.g. eggs' role to 
ensure mechanical stability or paprika's role to add vivid colors). Finally, the flavor of a dish 
owes as much to the mode of preparation as to the choice of particular ingredients lfT2l [T8l 
PT9l . However, our hypothesis is that given the large number of recipes we use in our analysis 
(56,498), such confounding factors can be systematically filtered out, allowing for the discovery 
of patterns that may transcend specific dishes or ingredients. 

Here we introduce a network-based approach to explore the impact of flavor compounds on 
ingredient combinations. Efforts by food chemists to identify the flavor compounds contained 
in most culinary ingredients allows us to link each ingredient to 51 flavor compounds on av- 
erage ll20lQ We build a bipartite network ED E2 IH IM 1221 |26l consisting of two different 
types of nodes: (i) 381 ingredients used in recipes throughout the world, and (ii) 1,021 flavor 
compounds that are known to contribute to the flavor of each of these ingredients (Fig. 1A). 
A projection of this bipartite network is the flavor network in which two nodes (ingredients) 
are connected if they share at least one flavor compound (Fig. IB). The weight of each link 
represents the number of shared flavor compounds, turning the flavor network into a weighted 
network Il27l 1221 |23l . While the compound concentration in each ingredient and the detection 
threshold of each compound should ideally be taken into account, the lack of systematic data 
prevents us from exploring their impact (see Sec SI. 1.2 on data limitations). 

Since several flavor compounds are shared by a large number of ingredients, the resulting 
flavor network is too dense for direct visualization (average degree (k) ~ 214). We therefore 
use a backbone extraction method [|28ll29l to identify the statistically significant links for each 
ingredient given the sum of weights characterizing the particular node (Fig. 2), see SI for de- 
tails). Not surprisingly, each module in the network corresponds to a distinct food class such as 
meats (red) or fruits (yellow). The links between modules inform us of the flavor compounds 

1 While finalizing this manuscript, an updated edition (6th Ed.) of Fenaroli 's handbook of flavor ingredients has 
been released. 
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that hold different classes of foods together. For instance, fruits and dairy products are close to 
alcoholic drinks, and mushrooms appear isolated, as they share a statistically significant number 
of flavor compounds only with other mushrooms. 

The flavor network allows us to reformulate the food pairing hypothesis as a topological 
property: do we more frequently use ingredient pairs that are strongly linked in the flavor net- 
work or do we avoid them? To test this hypothesis we need data on ingredient combinations 
preferred by humans, information readily available in the current body of recipes. For gen- 
erality, we used 56,498 recipes provided by two American repositories {epicurious.com and 
allrecipes.com) and to avoid a distinctly Western interpretation of the world's cuisine, we also 
used a Korean repository (menupan.com) (Fig. 1). The recipes are grouped into geographically 
distinct cuisines (North American, Western European, Southern European, Latin American, and 
East Asian; see Table S2). The average number of ingredients used in a recipe is around eight, 
and the overall distribution is bounded (Fig. 1C), indicating that recipes with a very large or 
very small number of ingredients are rare. By contrast, the popularity of specific ingredients 
varies over four orders of magnitude, documenting huge differences in how frequently various 
ingredients are used in recipes (Fig. ID), as observed in iTTOll . For example, jasmine tea, Ja- 
maican rum, and 14 other ingredients are each found in only a single recipe (see SI SI. 2), but 
egg appears in as many as 20,951, more than one third of all recipes. 

Results 

Figure 3D indicates that North American and Western European cuisines exhibit a statistically 
significant tendency towards recipes whose ingredients share flavor compounds. By contrast, 
East Asian and Southern European cuisines avoid recipes whose ingredients share flavor com- 
pounds (see Fig. 3D for the Z-score, capturing the statistical significance of AN S ). The system- 
atic difference between the East Asian and the North American recipes is particularly clear if we 

4 



inspect the P(iV s rand ) distribution of the randomized recipe dataset, compared to the observed 
number of shared compounds characterizing the two cuisines, N s . This distribution reveals that 
North American dishes use far more compound-sharing pairs than expected by chance (Fig. 3E), 
and the East Asian dishes far fewer (Fig. 3F). Finally, we generalize the food pairing hypothesis 
by exploring if ingredient pairs sharing more compounds are more likely to be used in specific 
cuisines. The results largely correlate with our earlier observations: in North American recipes, 
the more compounds are shared by two ingredients, the more likely they appear in recipes. By 
contrast, in East Asian cuisine the more flavor compounds two ingredients share, the less likely 
they are used together (Fig. 3G and 3H; see SI for details and results on other cuisines). 

What is the mechanism responsible for these differences? That is, does Fig. 3C through 
H imply that all recipes aim to pair ingredients together that share (North America) or do not 
share (East Asia) flavor compounds, or could we identify some compounds responsible for the 
bulk of the observed effect? We therefore measured the contribution Xi of each ingredient to 
the shared compound effect in a given cuisine c, quantifying to what degree its presence affects 
the magnitude of AN S . 

In Fig. 3I,J we show as a scatter plot Xi (horizontal axis) and the frequency fa for each 
ingredient in North American and East Asian cuisines. The vast majority of the ingredients lie 
on the Xi — axis, indicating that their contribution to AN S is negligible. Yet, we observe 
a few frequently used outliers, which tend to be in the positive Xi region for North American 
cuisine, and lie predominantly in the negative region for East Asian cuisine. This suggests that 
the food pairing effect is due to a few outliers that are frequently used in a particular cuisine, 
e.g. milk, butter, cocoa, vanilla, cream, and egg in the North America, and beef, ginger, pork, 
cayenne, chicken, and onion in East Asia. Support for the definitive role of these ingredients is 
provided in Fig. 3K,L where we removed the ingredients in order of their positive (or negative) 
contributions to AN S in the North American (or East Asian) cuisine, finding that the ^-score, 
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which measures the significance of the shared compound hypothesis, drops below two after 
the removal of only 13 (5) ingredients from North American (or East Asian) cuisine (see SI 
S2.2.2). Note, however, that these ingredients play a disproportionate role in the cuisine under 
consideration — for example, the 13 key ingredients contributing to the shared compound effect 
in North American cuisine appear in 74.4% of all recipes. 

According to an empirical view known as "the flavor principle" ll30ll . the differences be- 
tween regional cuisines can be reduced to a few key ingredients with specific flavors: adding 
soy sauce to a dish almost automatically gives it an oriental taste because Asians use soy sauce 
widely in their food and other ethnic groups do not; by contrast paprika, onion, and lard is a 
signature of Hungarian cuisine. Can we systematically identify the ingredient combinations 
responsible for the taste palette of a regional cuisine? To answer this question, we measure 
the authenticity of each ingredient (pf), ingredient pair (p^), and ingredient triplet (p^ fc ) (see 
Materials and Methods). In Fig. 4 we organize the six most authentic single ingredients, ingre- 
dient pairs and triplets for North American and East Asian cuisines in a flavor pyramid. The 
rather different ingredient classes (as reflected by their color) in the two pyramids capture the 
differences between the two cuisines: North American food heavily relies on dairy products, 
eggs and wheat; by contrast, East Asian cuisine is dominated by plant derivatives like soy sauce, 
sesame oil, and rice and ginger. Finally, the two pyramids also illustrate the different affinities 
of the two regional cuisines towards food pairs with shared compounds. The most authentic 
ingredient pairs and triplets in the North American cuisine share multiple flavor compounds, 
indicated by black links, but such compound- sharing links are rare among the most authentic 
combinations in East Asian cuisine. 

The reliance of regional cuisines on a few authentic ingredient combinations allows us to 
explore the ingredient-based relationship (similarity or dissimilarity) between various regional 
cuisines. For this we selected the six most authentic ingredients and ingredient pairs in each 
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regional cuisine (i.e. those shown in Fig. 4A,B), generating a diagram that illustrates the ingre- 
dients shared by various cuisines, as well as singling out those that are unique to a particular 
region (Fig. 4C). We once again find a close relationship between North American and West- 
ern European cuisines and observe that when it comes to its signature ingredient combinations 
Southern European cuisine is much closer to Latin American than Western European cuisine 
(Fig. AC). 

Discussion 

Our work highlights the limitations of the recipe data sets currently available, and more gener- 
ally of the systematic analysis of food preparation data. By comparing two editions of the same 
dataset with significantly different coverage, we can show that our results are robust against data 
incompleteness (see SI SI. 1.2). Yet, better compound databases, mitigating the incompleteness 
and the potential biases of the current data, could significantly improve our understanding of 
food. There is inherent ambiguity in the definition of a particular regional or ethnic cuisine. 
However, as discussed in SI SI. 2, the correlation between different datasets, representing two 
distinct perspectives on food (American and Korean), indicates that humans with different eth- 
nic background have a rather consistent view on the composition of various regional cuisines. 

Recent work by Kinouchi et al. lITOll observed that the frequency-rank plots of ingredients 
are invariant across four different cuisines, exhibiting a shape that can be well described by a 
Zipf-Mandelbrot curve. Based on this observation, they model the evolution of recipes by as- 
suming a copy-mutate process, leading to a very similar frequency-rank curve. The copy-mutate 
model provides an explanation for how an ingredient becomes a staple ingredient of a cuisine: 
namely, having a high fitness value or being a founder. The model assigns each ingredient a 
random fitness value, which represents the ingredient's nutritional value, availability, flavor, 
etc. For example, it has been suggested that each culture eagerly adopt spices that have high 
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anti-bacterial activity (e.g. garlic) flUEl, spices considered to have high fitness. The mutation 
phase of the model replaces less fit ingredients with more fit ones. Meanwhile, the copy mecha- 
nism keeps copying the founder ingredients — ingredients in the early recipes — and makes them 
abundant in the recipes regardless of their fitness value. 

It is worthwhile to discuss the similarity and difference between the quantities we measured 
and the concepts of fitness and founders. First of all, prevalence (Pf) and authenticity (pf) are 
empirically measured values while fitness is an intrinsic hidden variable. Among the list of 
highly prevalent ingredients we indeed find old ingredients — founders — that have been used in 
the same geographic region for thousands of years. At the same time, there are relatively new 
ingredients such as tomatoes, potatoes, and peppers that were introduced to Europe and Asia 
just a few hundred years ago. These new, but prevalent ingredients can be considered to have 
high fitness values. If an ingredient has a high level of authenticity, then it is prevalent in a 
cuisine while not so prevalent in all other cuisines. 

Indeed, each culture has developed their own authentic ingredients. It may indicate that 
fitness can vary greatly across cuisines or that the stochasticity of recipe evolution diverge the 
recipes in different regions into completely different sets. More historical investigation will 
help us to estimate the fitness of ingredients and assess why we use the particular ingredients 
we currently do. The higher order fitness value suggested in iflOl is very close to our concept of 
food pairing affinity. 

Another difference in our results is the number of ingredients in recipes. Kinouchi et al. 
reported that the average number of ingredients per recipe varies across different cookbooks. 
While we also observed variation in the number of ingredients per recipe, the patterns we found 
were not consistent with those found by Kinouchi et al. For instance, the French cookbook 
has more ingredients per recipe than a Brazillian one, but in our dataset we find the opposite 
result. We believe that a cookbook cannot represent a whole cuisine, and that cookbooks with 
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more sophisticated recipes will tend to have more ingredients per recipe than cookbooks with 
everyday recipes. As more complete datasets become available, sharper conclusions can be 
drawn regarding the size variation between cuisines. 

Our contribution in this context is a study of the role that flavour compounds play in de- 
termining these fitness values. One possible interpretation of our results is that shared flavor 
compounds represent one of several contributions to fitness value, and that, while shared com- 
pounds clearly play a significant role in some cuisines, other contributions may play a more 
dominant role in other cuisines. The fact that recipes rely on ingredients not only for flavor but 
also to provide the final textures and overall structure of a given dish provides support for the 
idea that fitness values depend on a multitude of ingredient characteristics besides their flavor 
profile. 

In summary, our network-based investigation identifies a series of statistically significant 
patterns that characterize the way humans choose the ingredients they combine in their food. 
These patterns manifest themselves to varying degree in different geographic regions: while 
North American and Western European dishes tend to combine ingredients that share flavor 
compounds, East Asian cuisine avoids them. More generally this work provides an example 
of how the data-driven network analysis methods that have transformed biology and the social 
sciences in recent years can yield new insights in other areas, such as food science. 

Methods 
Shared compounds 

To test the hypothesis that the choice of ingredients is driven by an appreciation for ingredient 
pairs that share flavor compounds (i.e. those linked in Fig. 2), we measured the mean number 
of shared compounds in each recipe, iV s , comparing it with iV s rand obtained for a randomly con- 
structed reference recipe dataset. For a recipe R that contains n R different ingredients, where 
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each ingredient % has a set of flavor compounds Cj, the mean number of shared compounds 

2 

n R (n R - 1) 



is zero if none of the ingredient pairs («, j) in the recipe share any flavor compounds. For ex- 
ample, the 'mustard cream pan sauce' recipe contains chicken broth, mustard, and cream, none 
of which share any flavor compounds (N S (R) = 0) in our dataset. Yet, N S (R) can reach as high 
as 60 for 'sweet and simple pork chops', a recipe containing apple, pork, and cheddar cheese 
(See Fig. 3A). To check whether recipes with high N S (R) are statistically preferred (implying 
the validity of the shared compound hypothesis) in a cuisine c with iV c recipes, we calculate 
AN S = N r s eal — iVJ and , where 'real' and 'rand' indicates real recipes and randomly constructed 
recipes respectively and N s = J2rN s (R)/N c (see SI for details of the randomization pro- 
cess). This random reference (null model) controls for the frequency of a particular ingredient 
in a given regional cuisine, hence our results are not affected by historical, geographical, and 
climate factors that determine ingredient availability (see SI SI. 1.2). 

Contribution 

The contribution x% of each ingredient to the shared compound effect in a given cuisine c, 
quantifying to what degree its presence affects the magnitude of AN S , is defined by 

where represents the ingredient i's number of occurrence. An ingredient's contribution is 
positive (negative) if it increases (decreases) AN S . 

Authenticity 

we define the prevalence Pf of each ingredient i in a cuisine c as P? = n^/N c , where n\ is 
the number of recipes that contain the particular ingredient % in the cuisine and iV c is the total 
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number of recipes in the cuisine. The relative prevalence pi = P? — (Pf) c >^ c measures the 
authenticity — the difference between the prevalence of i in cuisine c and the average prevalence 
of % in all other cuisines. We can also identify ingredient pairs or triplets that are overrepresented 
in a particular cuisine relative to other cuisines by defining the relative pair prevalences p?- = 

P ij ~ ( P ij)c^c and triplet prevalences p c ijk = P t c jk - {PQd^c, with P?. = n c ij /N c and P? jk = 
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Figure 1: Flavor network. (A) The ingredients contained in two recipes (left column), to- 
gether with the flavor compounds that are known to be present in the ingredients (right column). 
Each flavor compound is linked to the ingredients that contain it, forming a bipartite network. 
Some compounds (shown in boldface) are shared by multiple ingredients. (B) If we project the 
ingredient-compound bipartite network into the ingredient space, we obtain the flavor network, 
whose nodes are ingredients, linked if they share at least one flavor compound. The thickness 
of links represents the number of flavor compounds two ingredients share and the size of each 
circle corresponds to the prevalence of the ingredients in recipes. (C) The distribution of recipe 
size, capturing the number of ingredients per recipe, across the five cuisines explored in our 
study. (D) The frequency-rank plot of ingredients across the five cuisines show an approxi- 
mately invariant distribution across cuisines. 
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Figure 4: Flavor principles. (A,B) Flavor pyramids for North American and East Asian 
cuisines. Each flavor pyramid shows the six most authentic ingredients (i.e. those with the 
largest pf), ingredient pairs (largest p?-), and ingredient triplets (largest v%k)- The size of the 
nodes reflects the abundance P? of the ingredient in the recipes of the particular cuisine. Each 
color represents the category of the ingredient (see Fig. |2]for the color) and link thickness indi- 
cates the number of shared compounds. (C) The six most authentic ingredients and ingredient 
pairs used in specific regional cuisine. Node color represents cuisine and the link weight reflects 
the relative prevalence pi of the ingredient pair. 
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SI Materials and methods 

Sl.l Flavor network 

S 1 . 1 . 1 Ingredient-compounds bipartite network 

The starting point of our research is Fenaroli's handbook of flavor ingredients (fifth edition 0]|), 
which offers a systematic list of flavor compounds and their natural occurrences (food ingre- 
dients). Two post-processing steps were necessary to make the dataset appropriate for our 
research: (A) In many cases, the book lists the essential oil or extract instead of the ingredient 
itself. Since these are physically extracted from the original ingredient, we associated the flavor 
compounds in the oils and extracts with the original ingredient. (B) Another post-processing 
step is including the flavor compounds of a more general ingredient into a more specific ingre- 
dient. For instance, the flavor compounds in 'meat' can be safely assumed to also be in 'beef 
or 'pork' . 'Roasted beef contains all flavor compounds of 'beef and 'meat' . 

The ingredient-compound association extracted from [1J forms a bipartite network. As the 
name suggests, a bipartite network consists of two types of nodes, with connections only be- 
tween nodes of different types. Well known examples of bipartite networks include collabora- 
tion networks of scientists B2l (with scientists and publications as nodes) and actors [3] (with 
actors and films as nodes), or the human disease network [gj which connects health disorders 
and disease genes. In the particular bipartite network we study here, the two types of nodes are 
food ingredients and flavor compounds, and a connection signifies that an ingredient contains a 



2 



compound. 



The full network contains 1,107 chemical compounds and 1,531 ingredients, but only 381 



" ... . 



• • • • A * 



Figure SI: The full flavor network. The 
size of a node indicates average preva- 
lence, and the thickness of a link repre- 
sents the number of shared compounds. 
All edges are drawn. It is impossible to ob- 
serve individual connections or any modu- 
lar structure. 
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Figure S2: Degree distributions of the flavor network. Degree distribution of ingredients in 
the ingredient-compound network, degree distribution of flavor compounds in the ingredient- 
compound network, and degree distribution of the (projected) ingredient network, from left to 
right. Top: degree distribution. Bottom: complementary cumulative distribution. The line and 
the exponent value in the leftmost figure at the bottom is purely for visual guide. 
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3rd eds. 


5th eds. 


# of ingredients 


916 


1507 


# of compounds 


629 


1107 


# of edges in I-C network 


6672 


36781 



Table SI: The basic statistics on two different datasets. The 5th Edition of Fenaroli's handbook 
contains much more information than the third edition. 



ingredients appear in recipes, together containing 1,021 compounds (see Fig. SI ). We project 
this network into a weighted network between ingredients only (U [6l H |27]|. The weight of 
each edge Wij is the number of compounds shared between the two nodes (ingredients) i and 
j, so that the relationship between the M x M weighted adjacency matrix u>y and the N x M 
bipartite adjacency matrix a ik (for ingredient i and compound k) is given by: 



N 



W 



'■J 



0-ikO>jh 



(S3) 



k=l 



The degree distributions of ingredients and compounds are shown in Fig.[S2 
SI. 1.2 Incompleteness of data and the third edition 



The situation encountered here is similar to the one encountered in systems biology: we do not 
have a complete database of all protein, regulatory and metabolic interactions that are present 
in the cell. In fact, the existing protein interaction data covers less than 10% of all protein 
interactions estimated to be present in the human cell BH. 

To test the robustness of our results against the incompleteness of data, we have performed 
the same calculations for the 3rd edition of Fenaroli's handbook as well. The 5th edition con- 
tains approximately six times more information on the chemical contents of ingredients (Ta- 



ble SI ). Yet, our main result is robust (Fig. S3 ), further supporting that data incompleteness is 
not the main factor behind our findings. 
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North Western Latin Southern East North Western Latin Southern East 
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Figure S3: Comparing the third and fifth edition of Fenaroli's to see if incomplete data impacts 
our conclusions. The much sparser data of the 3rd edition (Top) shows a very similar trend 
to that of the 5th edition (Bottom, repeated from main text Fig. 3). Given the huge difference 
between the two editions (Table [ST]), this further supports that the observed patterns are robust. 
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Figure S4: The backbone of the ingredient network extracted according to iflOl with a signifi- 
cance threshold p = 0.04. Color indicates food category, font size reflects ingredient prevalence 
in the dataset, and link thickness represents the number of shared compounds between two in- 
gredients. 



SI. 1.3 Extracting the backbone 



The network's average degree is about 214 (while the number of nodes is 381). It is very dense 
and thus hard to visualize (see Fig. [ST). To circumvent this high density, we use a method that 
extracts the backbone of a weighted network ifTOl . along with the method suggested in ifTTll . 
For each node, we keep those edges whose weight is statistically significant given the strength 
(sum of weight) of the node. If there is none, we keep the edge with the largest weight. A 



different visualization of this backbone is presented in Fig. S4 Ingredients are grouped into 



6 



categories and the size of the name indicates the prevalence. This representation clearly shows 
the categories that are closely connected. 

SI. 1.4 Sociological bias 

Western scientists have been leading food chemistry, which may imply that western ingredients 
are more studied. To check if such a bias is present in our dataset, we first made two lists 
of ingredients: one is the list of ingredients appearing in North American cuisine, sorted by 
the relative prevalence pi (i.e. the ingredients more specific to North American cuisine comes 
first). The other is a similar list for East Asian cuisine. Then we measured the number of flavor 
compounds for ingredients in each list. The result in Fig. [S5| \ shows that any potential bias, if 
present, is not significant. 

There is another possibility, however, if there is bias such that the dataset tends to list more 
familiar (Western) ingredients for more common flavor compounds, then we should observe a 
correlation between the familiarity (frequently used in Western cuisine) and the degree of com- 
pound (number of ingredients it appears in) in the ingredient. Figure |S5}3 shows no observable 
correlation, however. 

SI. 2 Recipes 

The number of potential ingredient combinations is enormous. For instance, one could generate 
~ 10 15 distinct ingredient combinations by choosing eight ingredients (the current average per 
recipe) from approximately 300 ingredients in our dataset. If we use the numbers reported in 
Kinouchi et al. |[T2l (1000 ingredients and 10 ingredients per recipe), one can generate ~ 10 23 
ingredient combinations. This number greatly increases if we consider the various cooking 
methods. Regardless, the fact that this number exceeds by many orders of magnitude the ~ 10 6 
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Southern European 



East Asian 




Ingredient rank 



Ingredient rank 



Ingredient rank 



Ingredient rank 



Ingredient rank 



Figure S5: Are popular, much-used ingredients more studied than less frequent foods, leading 
to potential systematic bias? (A) We plot the number of flavor compounds for each ingredient as 
a function of the (ranked) popularity of the ingredient. The correlation is very small compared 
to the large fluctuations present. There is a weak tendency that the ingredients mainly used in 
North American or Latin American cuisine tend to have more odorants, but the correlations are 
weak (with coefficients of -0.13 and -0.10 respectively). A linear regression line is shown only 
if the corresponding p-value is smaller than 0.05. (B) If there is bias such that the book tends 
to list more familiar ingredients for more common flavor compounds, then we can observe the 
correlation between the familiarity (how frequently it is used in the cuisine) and the degree of 
the compound in the ingredient-compound network. The plots show no observable correlations 
for any cuisine. 
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recipes listed in the largest recipe repositories (e.g. |http : / / cookpad . com) indicates that 



humans are exploiting a tiny fraction of the culinary space. 

We downloaded all available recipes from three websites: allrecipes.com, epicurious.com, 
and menupan.com. Recipes tagged as belonging to an ethnic cuisine are extracted and then 
grouped into 1 1 larger regional groups. We used only 5 groups that each contain more than 



1,000 recipes (See Table S2). In the curation process, we made a replacement dictionary for 
frequently used phrases that should be discarded, synonyms for ingredients, complex ingredi- 
ents that are broken into ingredients, and so forth. We used this dictionary to automatically 
extract the list of ingredients for each recipe. As shown in Fig. ID, the usage of ingredients 
is highly heterogenous. Egg, wheat, butter, onion, garlic, milk, vegetable oil, and cream ap- 
pear more than 10,000 recipes while geranium, roasted hazelnut, durian, muscat grape, roasted 
pecan, roasted nut, mate, jasmine tea, jamaican rum, angelica, sturgeon caviar, beech, lilac 



flower, strawberry jam, and emmental cheese appear in only one recipe. Table S3 shows the 



correlation between ingredient usage frequency in each cuisine and in each dataset. Figure. S6 
shows that the three datasets qualitatively agree with each other, offering a base to combine 
these datasets. 

Sl.2.1 Size of recipes 



We reports the size of the recipes for each cuisine in Table S4 Overall, the mean number of 
ingredients per recipe is smaller than that reported in Kinouchi et al. lfT2l . We believe that it 
is mainly due to the different types of data sources. There are various types of recipes: from 
quick meals to ones used in sophisticated dishes of expensive restaurants; likewise, there are 
also various cookbooks. The number of ingredients may vary a lot between recipe datasets. If a 
book focuses on sophisticated, high-level dishes then it will contain richer set of ingredients per 
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Table S2: Number of recipes and the detailed cuisines in each regional cuisine in the recipe 
dataset. Five groups have reasonably large size. We use all cuisine data when calculating the 
relative prevalence and flavor principles. 



Cuisine set 


Number of recipes 


Cuisines included 


North American 


41525 


American, Canada, Cajun, Creole, Southern 






soul food, Southwestern U.S. 


Southern European 


4180 


Greek, Italian, Mediterranean, Spanish, Por- 






tuguese 


Latin American 


2917 


Caribbean, Central American, South American, 






Mexican 


Western European 


2659 


French, Austrian, Belgian, English, Scottish, 






Dutch, Swiss, German, Irish 


East Asian 


2512 


Korean, Chinese, Japanese 


Middle Eastern 


645 


Iranian, Jewish, Lebanese, Turkish 


South Asian 


621 


Bangladeshian, Indian, Pakistani 


Southeast Asian 


457 


Indonesian, Malaysian, Filipino, Thai, Viet- 






namese 


Eastern European 


381 


Eastern European, Russian 


African 


352 


Moroccan, East African, North African, South 






African, West African 


Northern European 


250 


Scandinavian 





Epicurious vs. Allrecipes 


Epicurious vs. Menupan 


Allrecipes vs. Menupan 


North American 


0.93 


N/A 


N/A 


East Asian 


0.94 


0.79 


0.82 


Western European 


0.92 


0.88 


0.89 


Southern European 


0.93 


0.83 


0.83 


Latin American 


0.94 


0.69 


0.74 


African 


0.89 


N/A 


N/A 


Eastern European 


0.93 


N/A 


N/A 


Middle Eastern 


0.87 


N/A 


N/A 


Northern European 


0.77 


N/A 


N/A 


South Asian 


0.97 


N/A 


N/A 


Southeast Asian 


0.92 


N/A 


N/A 



Table S3: The correlation of ingredient usage between different datasets. We see that the differ- 
ent datasets broadly agree on what constitutes a cuisine, at least at a gross level. 
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1 1 

■ Epicurious 
□ Allrecipes 

■ Menupan (Korean) 






i 












iii 



East Southern Latin Western North 
Asian European American European American 

Figure S6: Comparison between different datasets. The results on different datasets qualita- 
tively agree with each other (except Latin American cuisine). Note that menupan.com is a 
Korean website. 



North American 


7.96 


Western European 


8.03 


Southern European 


8.86 


Latin American 


9.38 


East Asian 


8.96 


Northern European 


6.82 


Middle Eastern 


8.39 


Eastern European 


8.39 


South Asian 


10.29 


African 


10.45 


Southeast Asian 


11.32 



Table S4: Average number of ingredients per recipe for each cuisine. 
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recipe; if a book focuses on simple home cooking recipes, then the book will contain fewer in- 
gredients per recipe. We believe that the online databases are close to the latter; simpler recipes 
are likely to dominate the database because anyone can upload their own recipes. By contrast, 
we expect that the cookbooks, especially the canonical ones, contain more sophisticated and 
polished recipes, which thus are more likely to contain more ingredients. 

Also, the pattern reported in Kinouchi et al. [I2l is reversed in our dataset: Western Euro- 
pean cuisine has 8.03 ingredients per recipe while Latin American cuisine has 9.38 ingredients 
per recipe. Therefore, we believe that there is no clear tendency of the number of ingredients 
per recipe between Western European and Latin American cuisine. 

Yet, there seems to be an interesting trend in our dataset that hotter countries use more 
ingredients per recipe, probably due to the use of more herbs and spices [fT3l [l4l or due to 
more diverse ecosystems. (6.82 in Northern European vs. 1 1.31 in Southeast Asian). Figure [S7] 
shows the distribution of recipe size in all cuisines. 



-+- North American 

Western European 
■ Southern European 
• Latin American 
+ East Asian 





-:- African 
-*- South Asian 
* Southeast Asian 
-B- Middle Eastern 

Eastern European 
-e- Northern European 



10 20 30 

Number of ingredients per recipe (s) 



10 20 30 40 
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Figure S7: Number of ingredients per recipe. North American and Western European cuisine 
shows similar distribution while other cuisines have slightly more ingredients per recipe. 
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1 10 100 

Number of duplicates, D 

Figure S8: If a recipe is very popular, the recipe databases will have a tendency to list more 
variations of the recipe. This plot shows that there are many duplicated recipes that share the 
same set of ingredients. The number of duplicates exhibits a heavy-tailed distribution. 

Sl.2.2 Frequency of recipes 



In contrast to previous work [12J that used published cookbooks, we use online databases. Al- 
though recipes online are probably less canonical than established cookbooks, online databases 
allow us to study much larger dataset more easily. Another important benefit of using online 
databses is that there is no real-estate issue in contrast to physical cookbooks that should care- 
fully choose what to include. Adding a slight variation of a recipe costs virtually nothing to the 
websites and even enhances the quality of the database. Therefore, one can expect that online 
databases capture the frequency of recipes more accurately than cookbooks. 

Certain recipes (e.g. signature recipes of a cuisine) are much more important than others; 



They are cooked much more frequently than others. Figure S8 shows that there are many du- 
plicated recipes (possessing identical sets of ingredients), indicating that popularity is naturally 
encoded in these datasets. 
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SI. 3 Number of shared compounds 



Figure |S9] explains how to measure the number of shared compounds in a hypothetical recipe 
with three ingredients. 

SI. 4 Shared compounds hypothesis 
Sl.4.1 Null models 



In order to test the robustness of our findings, we constructed several random recipe datasets 
using a series of appropriate null models and compare the mean number of shared compounds 
N s between the real and the randomized recipe sets. The results of these null models are sum- 



marized in Fig. S 10 , each confirming the trends discussed in the paper. The null models we 
used are: 



(A, B) Frequency-conserving. Cuisine c uses a set of n c ingredients, each with frequency fa. 
For a given recipe with Ni ingredients in this cuisine, we pick Ni ingredients randomly 
from the set of all n c ingredients, according to /j. That is the more frequently an ingredi- 
ent is used, the more likely the ingredient is to be picked. It preserves the prevalence of 



Shared Compounds 




Figure S9: For a recipe with three ingredients, we count the number of shared compounds in 
every possible pair of ingredients, and divide it by the number of possible pair of ingredients. 
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Figure S10: Four different null models. Although the size of the discrepancy between cuisines 
varies greatly, the overall trend is stable. 



15 



each ingredient. This is the null model presented in the main text. 

(C, D) Frequency and ingredient category preserving. With this null model, we conserve 
the category (meats, fruits, etc) of each ingredient in the recipe, and when sample ran- 
dom ingredients proportional to the prevalence. For instance, a random realization of a 
recipe with beef and onion will contain a meat and a vegetable. The probability to pick 
an ingredient is proportional to the prevalence of the ingredient in the cuisine. 

(E, F) Uniform random. We build a random recipe by randomly choosing an ingredient that 
is used at least once in the particular cuisine. Even very rare ingredients will frequently 
appear in random recipes. 

(G, H) Uniform random, ingredient category preserving. For each recipe, we preserve the 
category of each ingredient, but not considering frequency of ingredients. 



Although these null models greatly change the frequency and type of ingredients in the 
random recipes, North American and East Asian recipes show a robust pattern: North American 
recipes always share more flavor compounds than expected and East Asian recipes always share 
less flavor compounds than expected. This, together with the existence of both positive and 
negative j\TJ ea ' — j\rj and in every null model, indicates that the patterns we find are not due to a 
poorly selected null models. 



Finally, Fig. Sll shows the probability that a given pair with certain number of shared 
compounds will appear in the recipes, representing the raw data behind the generalized food- 
pairing hypothesis discussed in the text. To reduce noise, we only consider N s where there are 
more than five ingredient pairs. 



16 



P(n) 


0.9 
0.7 
0.5 
0.3 
0.1 


- North 
_ American 

_ 

: \— 


I 

G 

O Q 

° O CK, OO 
DO® » 

G 

° ° 

h 


I 

G - 

o 
o 

G - 




P(n) 


0.9 
0.7 
0.5 
0.3 
0.1 


~ Western 
- European 

o 

h- 


o 

h 


o 

- 

CD 


I 

- Southern 

- European 

o 

« o« 
cjo ocP 

1 1 h 


i 

OO 
G 




0.9 
0.7 


_ Latin 
_ American 






- East 
_ Asian 


r— 


P(n) 


0.5 
0.3 
0.1 


I 


I 


°o 

o — 
o 

n 


G 


G 

1 



50 100 150 50 100 150 

Number of shared compounds Number of shared compounds 



Figure S 1 1 : The probability that ingredient pairs that share a certain number of compounds also 
appear in the recipes. We enumerate every possible ingredient pair in each cuisine and show the 
fraction of pairs in recipes as a function of the number of shared compounds. To reduce noise, 
we only used data points calculated from more than 5 pairs. 
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Sl.4.2 Ingredient contributions 



To further investigate the contrasting results on the shared compound hypothesis for different 
cuisines, we calculate the contribution of each ingredient and ingredient pair to AN S . Since 
N S (R) for a recipe R is defined as 

N S (R)= . 2 - J2 \ C i nC j\ ( S4 ) 
n R (n R - 1 . .f-^ 

(where n R is the number of ingredients in the recipe R), the contribution from an ingredient 
pair can be calculated as following: 



k^M^T)^ " (f M(M-i| lan ^ ■ (S5) 



where indicates the ingredient i's number of occurrences. Similarly, the individual contribu- 
tion can be calculated: 



^fe^-^^k) 5 ) U<»*> 5Wi )• (S6) 



We list in Table. S5 the top contributors in North American and East Asian cuisines. 
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North American 


East Asian 




Ingredient % 


At 


Tnsredient % 


At 




milk 


0.529 


rice 


0.294 




butter 


0.511 


red bean 


0.152 




cocoa 


0.377 


milk 


0.055 




vanilla 


0.239 


preen tea 


0.041 




cream 


0.154 


butter 


0.041 




cream cheese 


0.154 


peanut 


0.038 




egg 


0.151 


mung bean 


0.036 


Positive 


neanut butter 


0.136 




0.033 




strawberrv 


0.106 


brown rice 

L/l W W XX 1 1 V V 


0.031 




cheddar cheese 


0.098 


nut 


0.024 




orange 


0.095 


mushroom 


0.022 




lemon 


0.095 


orange 


0.016 




coffee 


0.085 


soybean 


0.015 




cranberry 


0.070 


cinnamon 


0.014 




lime 


0.065 


enokidake 


0.013 




tomato 


-0.168 


beef 


-0.2498 




white wine 


-0.0556 


ginger 


-0.1032 




beef 


-0.0544 


pork 


-0.0987 




onion 


-0.0524 


cayenne 


-0.0686 




chicken 


-0.0498 


chicken 


-0.0662 




tamarind 


-0.0427 


onion 


-0.0541 




vinegar 


-0.0396 


fish 


-0.0458 


Negative 


pepper 


-0.0356 


bell pepper 


-0.0414 




pork 


-0.0332 


roasted sesame seed 


-0.0410 




celery 


-0.0329 


black pepper 


-0.0409 




bell pepper 


-0.0306 


shrimp 


-0.0408 




red wine 


-0.0271 


shiitake 


-0.0329 




black pepper 


-0.0248 


garlic 


-0.0302 




parsley 


-0.0217 


carrot 


-0.0261 




parmesan cheese 


-0.0197 


tomato 


-0.0246 



Table S5: Top 15 (both positive and negative) contributing ingredients to each cuisine. 
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